DATA3888_2025 Final Group Report

Predicting volatility

Author

Published

June 1, 2025

Executive Summary

Traditional volatility forecasting models like EGARCH, while statistically rigorous, struggle with the speed, noise, and complexity of modern high-frequency markets. As trading demands real-time responsiveness, reliance on lagged prices misses dynamic signals in live order book data.

This project investigates whether machine learning models trained on high-frequency order book features can outperform traditional econometric approaches in short-term volatility forecasting. Using second-by-second data from anonymous stocks, we engineered domain-informed features—capturing momentum, spread, and latent liquidity—and trained models including LightGBM, XGBoost, Random Forest, and robust regression.

Results strongly favor machine learning: LightGBM reduced error by over \(99\%\) versus EGARCH while delivering sub-millisecond prediction speed, ideal for real-time trading. SHAP analysis identified instantaneous price shifts and liquidity-weighted volatility as key drivers.

These insights power EchoVol20, an interactive dashboard that forecasts volatility in real time. Users can upload order book data, visualize predictive features, and receive actionable volatility regime classifications. A built-in risk calculator transforms forecasts into position sizing guidance—bridging predictive power with practical trading decisions. This work affirms our hypothesis and delivers a deployable system embodying a fast, adaptive, real-time market risk forecasting paradigm.

Introduction

Volatility—the degree of variation in asset returns—is more than a statistical artefact; it underpins key decisions in derivative pricing, portfolio allocations, and risk management (Bhambu et al, 2025). Yet in modern financial markets, where price dynamics unfold in miliseconds, volatility remains votoriously difficult to forecast. The consequences of misjudging it are severe: in 2024, a sudden spike in the CBOE Volatility Index (VIX) triggered the collapse of multiple short-volatility ETFs (Exchange-Traded Funds), erasing over $4 billion in value within days (Mackenzie, 2024). Such episodes reveal the critical need for more rsponsive and accurate volatility forecasting tools.

Historically, volatility modelling has been dominated by the Generalised Autoregressive Conditional Heteroskedasticity (GARCH) models, which leverage patterns in historical return series. Extensions like the Exponential GARCH (EGARCH) improved upon this by accounting for the asymmetric impact of negative returns on future volatility (Jonathan, 2025). However, these models assume stationarity and rely on linear dynamics, limiting their effectiveness in high-frequency environments characterised by rapid regime shifts and nonlinear interactions (Mohammed & Mudhir, 2020).

The rise of high-frequency trading (HFT) has transformed volatility into a high-resolution challenge. Order books now stream real-time data on liquidity, price imbalances, and spread dynamics—rich, untapped signals that often precede volatility shocks (Ashwin et al., 2012). Machine learning (ML) models, with their ability to extract patterns from high-dimensional and noisy data, offer a promising alternative. But a key question remains:

Important

Research Question

Do machine learning models trained on high-frequency order book data provide more accurate and timely short-term volatility forecasts than traditional econometric approaches?

To answer this, we conduct a systematic comparative study of:

  • EGARCH: A benchmark econometric model designed to capture volatility asymmetries.

  • LightGBM and XGBoost: Gradient boosting frameworks capable of modelling complex nonlinearities and interactions in noisy, high-frequency data.

  • Other contenders: Including Random Forest and robust linear models such as Huber Regressor.

As markets become increasingly automated, accurate real-time volatility forecasting is critical not just for performance, but for resilience. This study bridges statistical theory and modern data-driven methods, revealing how high-frequency data and machine learning can reshape our understanding of market dynamics.

Method Overview

Our methodology, summarized in Figure 1, begins with the collection of order book data for 126 anonymized stocks. Stocks were segmented into high, low, and general volatility cohorts based on average realized volatility. From each group, representative stocks and 100 time_ids were sampled to construct a unified yet volatility-diverse training dataset.

Next, we performed domain-driven feature engineering tailored to the microstructure of limit order books. Core predictive signals such as the volume-weighted average price (WAP), log returns, bid-ask spread deltas, volume-weighted volatility, and noise ratios were derived to capture transient liquidity stress and price momentum.

We then evaluated five model classes under controlled settings: EGARCH (as a traditional econometric benchmark), Huber Regressor, Random Forest, XGBoost, and LightGBM. Each model was assessed using rigorous metrics—RMSE, MAE, Mean Directional Accuracy (MDA), and prediction latency—to balance statistical precision with deployment viability.

After identifying LightGBM as the top-performing model in both accuracy and speed, we conducted targeted hyperparameter tuning using Optuna and feature selection based on SHAP values to refine predictive performance and model interpretability. The finalized model was then stress-tested on high- and low-volatility stock subsets to evaluate robustness and generalizability across market regimes.

Finally, the optimized model was integrated into a lightweight application interface designed for trading environments, enabling sub-millisecond predictions of short-term volatility.

Figure 1: This is the figure caption

Data Collection

Code
# import all libraries 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit
import lightgbm as lgb

Data Overview

Primary Training Dataset: High-Frequency Anonymized Order Book Data

We use a proprietary dataset containing high-frequency order book snapshots from 126 anonymized stocks. Each record is grouped by a unique time_id, representing an unspecified time segment within a trading day. These time_ids are neither sequential nor linked to actual timestamps, preventing direct temporal aggregation for standard time-series modeling.

Within each time_id, bid-ask prices and volumes are recorded at a second-level resolution for up to 600 seconds (simulating a 10-minute trading window). However, gaps exist due to intermittent recording—typical in high-frequency systems.

Due to the computational cost of processing full-depth order books across all stocks, we selected a representative random sample of five. This mirrors real-world scenarios where models must generalize to new instruments with sparse historical data.


Volatility-Stratified Subsets for Regime Evalutation

To test model robustness under different market conditions, we constructed three volatility-based subsets, stratified using average realized volatility per time_id (see Feature Engineering):

  • High-Volatility Dataset df_high: Top 5 stocks by average volatility.

  • Low-Volatility Dataset df_low: Bottom 5 stocks

  • General Dataset df_general: Random selection of 5 stocks in neither extreme groups, used as a baseline for model development.

Each subset contains 100 randomly sample time_ids per stock, concatenated into one final labeled dataset using a common preprocessing function.

Reproducibility Note: Stock rankings and selection logic are implemented in stock_ranking.py.

Code
def generate_representative_dataset(stock_files, label, sample_size=100, seed=42):
    np.random.seed(seed)
    sampled_dfs = []

    for file in stock_files:
        stock_df = pd.read_csv(file)
        stock_name = file.split('.')[0]
        sampled_ids = np.random.choice(stock_df['time_id'].unique(), size=sample_size, replace=False)
        sampled_df = stock_df[stock_df['time_id'].isin(sampled_ids)].copy()
        sampled_df['stock_name'] = stock_name
        sampled_df['dataset_label'] = label
        sampled_dfs.append(sampled_df)

    return pd.concat(sampled_dfs, ignore_index=True)
  
general_stocks = ['stock_39.csv', 'stock_47.csv', 'stock_61.csv', 'stock_84.csv', 'stock_93.csv']
high_vol_stocks = ['stock_6.csv', 'stock_27.csv', 'stock_75.csv', 'stock_80.csv', 'stock_18.csv']
low_vol_stocks = ['stock_41.csv', 'stock_43.csv', 'stock_29.csv', 'stock_46.csv', 'stock_125.csv']

# Uncomment to run function and generate data subsets 

# df_general = generate_representative_dataset(general_stocks, label='general') 
# df_high = generate_representative_dataset(high_vol_stocks, label='high')
# df_low = generate_representative_dataset(low_vol_stocks, label='low')

Additional Dataset of App Deployment For real-world deployment, we integrated a complementary dataset provided by Optiver. This includes real stocks such as Apple and Amazon, along with identifiable tickers and sequential time_ids mapped to real timestamps. This dataset was used in the interactive application to simulate predictions in a production-like setting.(See additional_data_merge.py for integration logic.)

Data Preprocessing

To derive informative features from the raw order book snapshots, we first computed the Weighted Average Price (WAP) for each second using the formula: \[ \text{WAP} = \frac{P_{bid}^{(1)}\cdot V_{bid}^{(1)} + P_{ask}^{(1)}\cdot V_{ask}^{(1)}}{V_{bid}^{(1)}+V_{ask}^{(1)}} \]

where \(P\) and \(V\) denote price and volume at Level 1, respectively.

Our target variable—realized volatility—was constructed by aggregating log returns over each time_id. Log returns were derived using the WAP, given by: \[ r_t = \log{(\text{WAP}_t)} - \log{(\text{WAP}_{t-1})} \]

Due to the non-sequential nature of time IDs, return-based calculations were constrained to within individual time_ids only. This design choice preserves the integrity of within-segment volatility dynamics and avoids false assumptions of temporal continuity.

Missing WAP values led to undefined returns; in such cases, we replaced NaN log returns with 0, effectively indicating no price movement. While simplistic, this imputation reflects the common assumption in high-frequency settings that missing ticks represent moments of inactivity. We acknowledge that this may understate micro-level volatility, but it ensures computational tractability and model compatibility.

Each time_id was into 20-second intervals, chosen to align with real-time decision horizons in high-frequency trading. Realized volatility over each 20-second window was computed as the square root of the sum of squared log returns: \[ RV_{[t,t+20]} = \sqrt{\sum_{i=t}^{t+19} r_i^2} \]This approach captures short-term market turbulence while smoothing out momentary spikes.

Finally, to ensure fair model training and evaluation, we verified that each of the selected stocks contained a sufficient number of unique time IDs, guaranteeing ample data for iteration and statistical robustness.

Code
def compute_wap(df):
    return (df['bid_price1'] * df['ask_size1'] +
            df['ask_price1'] * df['bid_size1']) / (df['bid_size1'] + df['ask_size1'])

def compute_log_returns(wap_series):
    return np.log(wap_series).diff().fillna(0)

Feature Engineering

To ensure a fair model comparison, we constructed a unified feature set using only Level 1 and Level 2 order book data, targeting early indicators of short-term volatility. Each feature class reflects a distinct mechanism driving price instability (full list in Appendix A).

1. Volatility Memory:

Rolling standard deviations of WAP log returns (1–5s) capture short-term persistence in return variability—crucial for anticipating clustered volatility.

2. Price Momentum and Directionality:

Lagged WAP values and first differences model emerging directional trends. A rolling trend estimate over 5s highlights sustained drifts preceding volatility spikes.

3. Microstructure Frictions:

Spread levels and deltas indicate rising execution risk and liquidity withdrawal—early signals of market stress.

4. Latent liquidity and Order Book Pressure:

Imbalance velocity, noise ratios, and volume-weighted volatility reveal hidden stress in order flow. These nonlinear signals are particularly effective for tree-based models.

Code
def add_return_features(df):
    for i in range(1, 6):
        df[f'log_ret_std_{i}'] = df['log_ret'].shift(1).rolling(window=i, min_periods=1).std()
    return df

def add_wap_features(df):
    for lag in range(1, 6):
        df[f'wap_lag_{lag}'] = df['WAP'].shift(lag)
        df[f'wap_delta_{lag}'] = df['WAP'] - df[f'wap_lag_{lag}']
    df['wap_trend_5s'] = df[[f'wap_delta_{i}' for i in range(1, 6)]].mean(axis=1)
    return df

def add_spread_features(df):
    df['spread'] = (df['ask_price1'] - df['bid_price1']) / df['ask_price1']
    for lag in range(1, 6):
        df[f'spread_lag_{lag}'] = df['spread'].shift(lag)
        df[f'spread_delta_{lag}'] = df['spread'] - df[f'spread_lag_{lag}']
    return df

def add_liquidity_features(df):
    df['liquidity_shock'] = df['bid_size1'].rolling(5).std() / df['bid_size1'].rolling(20).mean()
    df['imbalance_velocity'] = ((df['bid_size1'] - df['ask_size1']) / (df['bid_size1'] + df['ask_size1'])).diff().rolling(3).mean()
    df['vol_weighted_vol'] = df['log_ret'].abs() * df['bid_size1'].rolling(10).sum()
    df['noise_ratio'] = df['ask_price1'].diff().abs() / df['WAP'].diff().abs().rolling(5).std()
    df['hidden_liquidity'] = (df['bid_size2'] + df['ask_size2']) / (df['bid_size1'] + df['ask_size1'])
    return df

Model Candidates

In selecting our candidate models, we aimed to span a large spectrum of volatility forecasting paradigms—comparing advanced machine learning to a traditional benchmark EGARCH, with simpler models chosen to provide context for incremental gains and diagnostic insight.


Machine Learning Models: Capturing Nonlinear Microstructure Dynamics

  • LightGBM (Ke et al., 2017): A fast, memory-efficient gradient booster using leaf-wise growth and histogram-based binning—well-suited to high-dimensional, sparse order book signals.

  • XGBoost (Chen & Guestrin, 2016): A widely used gradient booster with level-wise tree construction and strong regularisation.

Both models target complex, transient patterns characteristic of short-horizon volatility.


Reference Models: Interpretability and Diagnostic Insight

  1. Random Forest (Breiman, 2001): An ensemble of decorrelated trees offering a non-boosted baseline. Its comparison to LightGBM and XGBoost isolates the contribution of boosting.

  2. Huber Regression: A robust linear model tested for its ability to extract stable relationships from noisy microstructure features.

Model Training and Testing

Machine Learning Models

We developed a time-consistent training pipeline tailored to high-frequency volatility prediction. It enforces no information leakage, supports lag-based features, and is robust to common pathologies in order book data.

Data Imputation and Preprocessing Log and ratio transformations often produced infinities. We treated these as missing values, then applied forward- and back-filling within each time_id. Remaining gaps were filled with zero to preserve continuity without introducing noise.

Code
# Function that creates all features defined above at once 
def engineer_features(df):
    df = add_return_features(df)
    df = add_wap_features(df)
    df = add_spread_features(df)
    df = add_liquidity_features(df)
    df = df.replace([np.inf, -np.inf], np.nan)
    return df.ffill().bfill().fillna(0)

Time-Aware Splitting

Each time_id was split by seconds_in_bucket:

  • First 64% → training

  • Next 16% → validation

  • Final 20% → testing

The first 5 seconds were excluded from all sets to avoid lookahead in lagged features.

Code
# Function for splitting into train/val/test sets 
def split_data(df, feature_cols, target_col='log_ret'):
    df = df[df['seconds_in_bucket'] >= 5].reset_index(drop=True)
    train_val_mask = df['seconds_in_bucket'] <= 480
    test_mask = df['seconds_in_bucket'] > 480
    train_mask = df['seconds_in_bucket'] <= 480 * 0.8
    val_mask = (df['seconds_in_bucket'] > 480 * 0.8) & train_val_mask

    return (
        df[train_mask][feature_cols], df[train_mask][target_col],
        df[val_mask][feature_cols], df[val_mask][target_col],
        df[test_mask][feature_cols], df[test_mask][target_col],
        df[test_mask]
    )

Model Configuration All tree-based models used shallow depths (max_depth=5) and conservative learning rates (0.03) to limit overfitting on noisy signals. Gradient boosting models were tuned with early stopping, while Huber regression used default robustness settings. All models followed a common parameter structure:

Code
# Consistent parameters to feed in each model, based on theory 
BASE_PARAMS = {
    "lightgbm": {
        "learning_rate": 0.03,
        "num_leaves": 31,
        "max_depth": 5,
        "min_data_in_leaf": 20,
        "feature_fraction": 0.8,
        "bagging_freq": 10,
        "verbosity": -1,
        "lambda_l2": 0
    },
    "xgboost": {
        'objective': 'reg:squarederror',
        'eval_metric': 'mae',
        "learning_rate": 0.03,
        "max_depth": 5,
        "min_child_weight": 5,
        "subsample": 0.8
    },
    "random_forest": {
        "max_depth": 5,
        "min_samples_leaf": 20,
        "n_estimators": 300
    },
    "huber": {
        "epsilon": 1.35, 
        "max_iter": 500
    }
}

During evaluation, each model predicted log returns over the final 120 seconds of each time_id, grouped into six 20-second windows. Realised volatility was computed per window and aggregated to evaluate model performance.

EGARCH Model

The EGARCH(1,1,1) model was trained on 80% of raw log returns and tested on the final 20%. Returns were scaled by \(10^4\) to improve numerical stability. The model specification included:

  • GARCH (\(1\)): Capturing volatility persistence.

  • ARCH (\(1\)): Accounting for immediate return shocks.

  • Asymmetric term (\(1\)): Leverage effects in volatility response.

    Warning

    Despite stabilisation efforts, convergence issues occasionally occurred—reflecting the fragility of parametric models on irregular, heavy-tailed high-frequency data.

Evaluation Metrics

Model performance was assessed using three key metrics: Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Directional Accuracy (MDA).

  • RMSE amplifies large errors, making it ideal for risk-aware applications where major mispredictions could lead to disruptive trades.

  • MAE offers a noise-tolerant alternative by averaging absolute deviations, better reflecting performance under typical high-frequency noise.

  • MDA evaluates whether the model correctly predicts the direction of volatility changes—crucial for anticipating market regime shifts and informing position management.

Code
def calculate_mda(actual, predicted):
    
    actual = np.array(actual)
    predicted = np.array(predicted)
    
    actual_diff = np.diff(actual)
    actual_signs = np.sign(actual_diff)
    predicted_diff = np.diff(predicted)
    predicted_signs = np.sign(predicted_diff)
    
    num_correct = np.sum(actual_signs == predicted_signs)
    mda = num_correct / (len(actual) - 1)
    
    return mda

Together, these metrics provide a balanced view of both error magnitude and predictive usefulness, especially in fast-moving trading environments.

Results

Our evaluation revealed consistent and substantial performance differences across forecasting models, with machine learning methods—particularly LightGBM—demonstrating superior ability to predict short-term volatility from high-frequency order book data. Figure 2 summarises predictive accuracy (RMSE, MAE), directional correctness (MDA), and inference latency, averaged across all time_ids in the test set.


Machine Learning Models Outperform Traditional Approaches

Among all models tested, LightGBM emerged as the top performer. It achieved the lowest RMSE (0.000091) and MAE (0.000066), representing over 99% error reduction compared to the EGARCH baseline. Its directional accuracy (MDA) reached 87.42%, confirming its ability to anticipate not just volatility magnitude but also direction—critical for actionable trading signals.

By contrast, EGARCH recorded the weakest performance: RMSE and MAE were orders of magnitude higher, and directional accuracy fell below chance (49.45%). This confirms its limitations in adapting to the non-stationary, fragmented nature of high-frequency data.


Tree-Based Models: LightGBM vs. XGBoost vs. Random Forest

Although XGBoost shares the same gradient-boosting lineage as LightGBM, it underperformed significantly, with 70% higher RMSE and 77% higher MAE. Architectural differences—particularly XGBoost’s level-wise tree growth—likely contributed to its inability to capture sharp, sparse order book patterns.

Random Forest offered strong predictive accuracy (RMSE: 0.000115; MDA: 81.64%) and robustness to noise, demonstrating the value of ensemble trees. However, its 0.904ms latency makes it impractical for real-time deployment.

In contrast, LightGBM balanced both accuracy and inference speed (0.157ms), making it the most suitable candidate for high-frequency environments.


Linear Baselines: Huber Regression and EGARCH

Huber regression, while extremely fast (0.050ms) and moderately robust (MDA: 70.85%), lacked the flexibility to model the nonlinear, transient dynamics present in order book data. Its performance confirms that linear models—even with robust loss functions—are inadequate for short-term volatility forecasting in high-frequency contexts.

As noted, EGARCH performed worst on all metrics, validating our hypothesis that traditional econometric models are ill-equipped to handle the microstructural complexities of real-time trading environments.

Code
from matplotlib.ticker import LogLocator

# Load data
final_metrics = pd.read_csv("model_results/final_metrics.csv")
avg_metrics_df = final_metrics.groupby("model")[["rmse", "mae", "mda", "prediction_time"]].mean().reset_index()

# Setup
model_order = ["egarch", "huber", "lightgbm", "random_forest", "xgboost"]
metrics = ["rmse", "mae"]
colors = {"rmse": "#f3e9d2", "mae": "#01426a", "mda": "#76c7c0"}

# Sort models
avg_metrics_df["model"] = pd.Categorical(avg_metrics_df["model"], categories=model_order, ordered=True)
avg_metrics_df = avg_metrics_df.sort_values("model")

# Create figure
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 8), gridspec_kw={'width_ratios': [3.5, 2]})

# --- LEFT: RMSE + MAE ---
x = np.arange(len(model_order))
width = 0.35
rmse_vals = avg_metrics_df["rmse"].values
mae_vals = avg_metrics_df["mae"].values

# Bars
bar1 = ax1.bar(x - width/2, rmse_vals, width, label='RMSE', color=colors["rmse"])
bar2 = ax1.bar(x + width/2, mae_vals, width, label='MAE', color=colors["mae"])

# Log scale and grid
ax1.set_yscale('log')
ax1.set_xticks(x)
ax1.set_xticklabels(model_order, rotation=15)
ax1.set_title("Model RMSE & MAE (log scale)", fontsize=14, pad=25)
ax1.set_ylabel("Score (log)")
ax1.set_xlabel("Model")
ax1.legend()
ax1.grid(True, axis='y', linestyle='--', linewidth=0.5)
ax1.grid(True, axis='x', linestyle='--', linewidth=0.5)

# Labels
for bars in [bar1, bar2]:
    for bar in bars:
        height = bar.get_height()
        ax1.text(bar.get_x() + bar.get_width()/2, height * 1.05,
                 f"{height:.2e}", ha='center', va='bottom', fontsize=8)

# Highlight best
totals = rmse_vals + mae_vals
best_idx = np.argmin(totals)
avg_y = (rmse_vals[best_idx] + mae_vals[best_idx]) / 2
ax1.scatter(best_idx, avg_y * 1.7, s=220, color="crimson", alpha=0.3, zorder=5)
ax1.scatter(best_idx, avg_y * 1.7, s=80, color="crimson", zorder=6, edgecolor='white', linewidth=1)
ax1.text(best_idx, avg_y * 2.2, "Best Overall", ha='center', va='bottom',
         fontsize=9, fontweight='bold', color="crimson", backgroundcolor="white")

# --- RIGHT: MDA ---
mda_vals = avg_metrics_df["mda"].values
bar3 = ax2.bar(x, mda_vals, width=0.5, color=colors["mda"])

ax2.set_xticks(x)
ax2.set_xticklabels(model_order, rotation=15)
ax2.set_title("Directional Accuracy (MDA)", fontsize=14, pad=25)
ax2.set_ylabel("MDA Score")
ax2.set_xlabel("Model")
ax2.grid(True, axis='y', linestyle='--', linewidth=0.5)
ax2.grid(True, axis='x', linestyle='--', linewidth=0.5)

# Labels
for bar in bar3:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2, height * 1.01,
             f"{height:.2%}", ha='center', va='bottom', fontsize=8)

# Highlight best
best_mda_idx = np.argmax(mda_vals)
ax2.scatter(best_mda_idx, mda_vals[best_mda_idx] * 1.1, s=220, color="crimson", alpha=0.3, zorder=5)
ax2.scatter(best_mda_idx, mda_vals[best_mda_idx] * 1.1, s=80, color="crimson", zorder=6, edgecolor='white', linewidth=1)
ax2.text(best_mda_idx, mda_vals[best_mda_idx] * 1.15, "Best MDA", ha='center', va='bottom',
         fontsize=9, fontweight='bold', color="crimson", backgroundcolor="white")

# Final layout
fig.tight_layout()
plt.show()
plt.close()

From a practical standpoint, LightGBM dominates the speed-accuracy trade-off, as shown in Figure 3. Its latency of 0.157ms sits comfortably within real-time trading constraints, while significantly outperforming all other models in predictive accuracy. XGBoost offers neither top-tier accuracy nor latency, while Random Forest’s strong accuracy is undercut by latency. Huber regression, though fastest, fails to capture the data’s complexity.

These results validate LightGBM as the most viable model for real-time volatility forecasting: fast, accurate, and responsive to the fleeting signals that drive short-term market risk.

Code
# Prepare data
df_time = avg_metrics_df[['model', 'prediction_time']].copy()
df_time['relative_to_egarch'] = df_time['prediction_time'] / df_time[df_time['model'] == 'egarch']['prediction_time'].values[0]
df_time['relative_to_egarch'] = df_time['relative_to_egarch'].apply(lambda x: f"{x:.2f}x")
df_time['prediction_time'] = df_time['prediction_time'].apply(lambda x: f"{x:.2f}s")
df_time = df_time.sort_values('prediction_time').reset_index(drop=True)

# Create figure
fig, ax = plt.subplots(figsize=(8, 3))
ax.axis('off')

# Create table
table = ax.table(
    cellText=df_time.values,
    colLabels=["Model", "Prediction Time", "Relative to EGARCH"],
    loc='center',
    cellLoc='center'
)

# Style table
table.auto_set_font_size(False)
table.set_fontsize(11)
table.scale(1, 1.5)  # Adjust cell padding

# Header style
for (row, col), cell in table.get_celld().items():
    if row == 0:  # Header row
        cell.set_text_props(weight='bold')
        cell.set_facecolor('#f3e9d2')  # Light gray header
    cell.set_edgecolor('#dddddd')  # Border color

plt.tight_layout()
plt.show()

Hyperparameter Tuning

To maximise LightGBM’s real-world performance and robustness, we conducted systematic hyperparameter tuning using a time-aware optimisation pipeline. Our aim was not just to enhance accuracy, but to ensure generalisation across the abrupt regime shifts common in high-frequency markets.

To achieve this, we developed a robust and time-aware tuning pipeline combining three core components:

  1. Temporal Integrity with TimeSeriesSplit Cross Validation

We employed TimeSeriesSplit cross-validation to preserve chronological structure and avoid lookahead bias. Each training fold strictly preceded its corresponding validation set, replicating realistic forecasting conditions. This setup ensured that performance gains were genuine and generalisable—not artefacts of data leakage.

  1. Efficient Hyperparameter Optimisation with optuna

Hyperparameters were tuned using Optuna, an efficient Bayesian optimisation framework. Compared to traditional grid or random search, Optuna accelerated convergence and reduced computational cost via:

Note
  • Adaptive learning that targeted high-performing regions,

  • Early stopping for poor-performing trials, and

  • Parallel execution across trials.

This allowed broader, deeper exploration of LightGBM’s parameter space without compromising evaluation rigor.

  1. Stable Evaluation via RMSE averaging

Performance during tuning was assessed using average RMSE across five sequential folds. This smoothed out the impact of volatile market intervals and provided a stable metric for guiding the search process.

Code
# LightGBM model 
df_general = pd.read_csv("vol_subsets/df_general.csv")
X_all, y_all = [], []
for (stock, time_id), df_subset in df_general.groupby(['stock_name', 'time_id']):
        df = df_subset.copy()
        df['WAP'] = compute_wap(df)
        df['log_ret'] = compute_log_returns(df['WAP'])
        df = engineer_features(df)
        
        feature_cols = [f'log_ret_std_{i}' for i in range(1, 6)] + \
        [f'wap_lag_{i}' for i in range(1, 6)] + \
        [f'wap_delta_{i}' for i in range(1, 6)] + \
        ['wap_trend_5s'] + ['spread'] + \
        [f'spread_lag_{i}' for i in range(1, 6)] + \
        [f'spread_delta_{i}' for i in range(1, 6)] + \
        ['liquidity_shock', 'imbalance_velocity', 'vol_weighted_vol', 'noise_ratio', 'hidden_liquidity']
        
        X_train, y_train, *_ = split_data(df, feature_cols)
        X_all.append(X_train)
        y_all.append(y_train)

X_train_all = pd.concat(X_all).sort_index()
y_train_all = pd.concat(y_all).sort_index()

# TimeseriesSplit with optuna
def objective(trial):
    params = {
        'objective': 'regression',
        'metric': 'rmse',
        'verbosity': -1,
        'boosting_type': 'gbdt',
        'num_leaves': trial.suggest_categorical('num_leaves', [20, 31, 40, 50]),
        'max_depth': trial.suggest_int('max_depth', 3, 7),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.05, step=0.01),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.7, 1.0, step=0.1),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.7, 1.0, step=0.1),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 20, 60, step=10),
        'lambda_l1': trial.suggest_float('lambda_l1', 0.0, 1.0, step=0.1),
        'lambda_l2': trial.suggest_float('lambda_l2', 0.0, 1.0, step=0.1),
    }

    tscv = TimeSeriesSplit(n_splits=5)
    scores = []
    for train_idx, val_idx in tscv.split(X_train_all):
        X_t, X_v = X_train_all.iloc[train_idx], X_train_all.iloc[val_idx]
        y_t, y_v = y_train_all.iloc[train_idx], y_train_all.iloc[val_idx]

        lgb_train = lgb.Dataset(X_t, y_t)
        lgb_valid = lgb.Dataset(X_v, y_v, reference=lgb_train)

        model = lgb.train(
            params,
            lgb_train,
            valid_sets=[lgb_valid],
            num_boost_round=300,
            callbacks=[lgb.early_stopping(20), lgb.log_evaluation(0)])

        y_pred = model.predict(X_v, num_iteration=model.best_iteration)
        score = mean_squared_error(y_v, y_pred)
        scores.append(score)

    return np.mean(scores)

Feature Selection

We applied SHAP (SHapley Additive Explanations) to the optimised LightGBM model to assess feature contributions across the test set. Figure 4 displays the top ten features ranked by mean absolute SHAP value.


Microstructure Features Dominate

The model’s predictions were overwhelmingly driven by ultra-recent WAP (weighted average price) differentials, particularly wap_delta_1, which had a SHAP impact over 10× higher than any other feature. This confirms our hypothesis: short-term volatility is governed by immediate order book imbalances, not lagged returns.

Successive wap_delta_k features (2–5) were also influential, indicating that the model captures fine-grained price dynamics akin to momentum and acceleration at millisecond resolution—critical for forecasting volatility spikes.


Liquidity and Noise Signals

Features such as vol_weighted_vol and noise_ratio contributed meaningfully, reflecting liquidity-sensitive volatility and the destabilising effect of conflicting quote signals. These reinforce the role of order flow quality in anticipating abrupt market movements.


Supporting Context

Lower-ranked features, including wap_lag_1 and spread_delta_1, offered contextual information but had substantially less influence. Their limited contribution underscores a key result: effective volatility prediction in high-frequency settings depends primarily on instantaneous, not historical, signals.

Code
shap_mean = pd.read_csv("shap_mean_importance.csv", index_col=0, header=None)
shap_mean.columns = ['mean_abs_shap']
top_10_df = shap_mean.sort_values(by='mean_abs_shap', ascending=False).head(10)
top_10_df = top_10_df[::-1]

# Plot SHAP values
fig, ax = plt.subplots(figsize=(8, 6))
bars = ax.barh(top_10_df.index, top_10_df['mean_abs_shap'], color="#003366")

# Add text at the end of bars
offset = top_10_df['mean_abs_shap'].max() * 0.02
for bar, value in zip(bars, top_10_df['mean_abs_shap']):
    ax.text(bar.get_width() + offset, bar.get_y() + bar.get_height()/2,
            f"{value:.2e}", va='center', ha='left', fontsize=10)

# Style
ax.set_title('Top 10 Most Important Features (SHAP)', fontsize=16, weight='bold')
ax.set_xlabel('Mean |SHAP value|', fontsize=12)
ax.set_ylabel('Feature', fontsize=12)
ax.set_facecolor('white')
ax.grid(axis='x', linestyle='--', color='lightgray')
ax.spines[['top', 'right']].set_visible(False)
fig.tight_layout()
plt.show()

Robustness Check and Generalisability

To assess generalisation, we evaluated LightGBM on disjoint high- and low-volatility subsets (df_high, df_low) derived from the stock dataset initially.

The model performed strongly on df_low, exceeding its full-sample baseline in both RMSE and MAE—demonstrating effective generalisation under stable conditions. In contrast, performance declined on df_high, reflecting the inherent difficulty of capturing sharp, irregular volatility spikes using static-tree models optimised on broader market conditions. (See Appendix for metrics)

Despite this drop, LightGBM retained operational utility, with results supporting its use as a core forecasting module in risk-aware trading systems operating in low- to moderate-volatility regimes.

ECHOVOL20 App Deployment